-
Notifications
You must be signed in to change notification settings - Fork 762
Optimize HNSW Memory Allocation with Chunked Arrays #633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Optimize HNSW Memory Allocation with Chunked Arrays #633
Conversation
…tempt to fix Python bindings.
…ng in Python bindings.
PingXie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Thanks for upstreaming this change, Michael!
|
LGTM! In case not already done, it would be useful to run a sanity test with an ASAN build. |
|
@yairgott : ASAN tests are already being run in GitHub Actions. |
|
Here are some performance test results. Used ann-benchmarks at commit f402b2cc17b980d7cd45241ab5a7a4cc0f965e55 with the following patch applied: https://gist.githubusercontent.com/michaelbautin/fe38704f9678532ee42d0788b0fc9ccb/raw Used the following script to alternate between unmodified and modified versions of hnswlib: https://gist.githubusercontent.com/michaelbautin/b7b6264f899e835d2da06d641c1e05f8/raw The modified version of hnswlib ( Tested on a 96-vCPU GCE VM with an INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.30GHz CPU. The index build times did not change significantly and were in the range of 421-424 seconds for both versions of hnswlib on the deep-image-96-angular ann-benchmarks dataset. According to a two-sided t-test computed using the
Tests using Baseline (2fba7fb): With the changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes HNSW index memory allocation by replacing expensive realloc calls with a chunked array strategy. The changes significantly reduce memory consumption (eliminating temporary memory doubling during resize) and improve performance (avoiding costly memcpy operations) for large-scale indices.
Key Changes:
- Introduced
ChunkedArraytemplate class for managing large array-like data with fixed-size memory chunks - Converted
data_level0_memory_andlinkLists_from raw pointers toChunkedArrayinstances - Replaced conditional prefetch macros with unified
HNSWLIB_MM_PREFETCHmacro with bounds checking
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| hnswlib/hnswlib.h | Adds ChunkedArray template class and HNSWLIB_MM_PREFETCH macro for chunked memory management |
| hnswlib/hnswalg.h | Converts HNSW data structures to use ChunkedArray and updates prefetch calls with bounds checking |
| python_bindings/bindings.cpp | Updates Python bindings to use new copyTo/copyFrom methods and getLinkListPtr/setLinkListPtr accessors |
| CMakeLists.txt | Removes blank line (formatting) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #include <assert.h> | ||
|
|
||
| #include <memory> | ||
| #include <type_traits> |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ChunkedArray class uses std::deque on line 556, but the header <deque> is not included. This will cause a compilation error. Add #include <deque> to the includes section near the top of the file.
| #include <type_traits> | |
| #include <type_traits> | |
| #include <deque> |
| HNSWLIB_MM_PREFETCH((char *) (visited_array + *(data + 1)), _MM_HINT_T0); | ||
| HNSWLIB_MM_PREFETCH((char *) (visited_array + *(data + 1) + kCacheLineSize), _MM_HINT_T0); | ||
| HNSWLIB_MM_PREFETCH(getDataByInternalId(*datal), _MM_HINT_T0); | ||
| HNSWLIB_MM_PREFETCH(getDataByInternalId(*(datal + 1)), _MM_HINT_T0); |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefetching *(datal + 1) without checking if size >= 2 could lead to out-of-bounds memory access. If size is 0 or 1, this prefetch will read beyond the valid data. Consider adding a bounds check: if (size >= 2) { HNSWLIB_MM_PREFETCH(...) }.
| HNSWLIB_MM_PREFETCH(getDataByInternalId(*(datal + 1)), _MM_HINT_T0); | |
| if (size >= 2) { | |
| HNSWLIB_MM_PREFETCH(getDataByInternalId(*(datal + 1)), _MM_HINT_T0); | |
| } |
| size_t cur_size = std::min(bytes_per_chunk, num_bytes); | ||
| memcpy(destination, chunks_[chunk_index].get(), cur_size); | ||
| num_bytes -= cur_size; | ||
| destination += cur_size; |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyTo method doesn't increment chunk_index in the loop, which means it will always copy from chunk 0 repeatedly. This will result in incorrect data being copied. The line should increment chunk_index after each iteration: chunk_index++.
| destination += cur_size; | |
| destination += cur_size; | |
| ++chunk_index; |
| void copyFrom(const char* source, size_t num_bytes) { | ||
| size_t chunk_index = 0; | ||
| size_t bytes_per_chunk = getSizePerChunk(); | ||
| while (num_bytes > 0) { | ||
| size_t cur_size = std::min(bytes_per_chunk, num_bytes); | ||
| memcpy(chunks_[chunk_index].get(), source, cur_size); | ||
| num_bytes -= cur_size; | ||
| source += cur_size; | ||
| } | ||
| } |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyFrom method doesn't increment chunk_index in the loop, which means it will always copy to chunk 0 repeatedly. This will result in data corruption. The line should increment chunk_index after each iteration: chunk_index++.
| linklistsizeint *get_linklist0(tableint internal_id, char *data_level0_memory_) const { | ||
| return (linklistsizeint *) (data_level0_memory_ + internal_id * size_data_per_element_ + offsetLevel0_); | ||
| return (linklistsizeint *) (data_level0_memory_[internal_id] + offsetLevel0_); | ||
| } |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter data_level0_memory_ shadows the member variable with the same name but is never used in the function body. The function directly accesses data_level0_memory_[internal_id] which refers to the member variable. This unused parameter should either be removed or the function should use it instead of the member variable.
| ChunkedArray(const ChunkedArray& other) = delete; | ||
| ChunkedArray& operator=(const ChunkedArray& other) = delete; | ||
|
|
||
| ChunkedArray(ChunkedArray&& other) noexcept { |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The move constructor doesn't initialize member variables before calling swap. This means it's swapping with uninitialized values. The member variables should be initialized (e.g., using the default constructor's member initializer list) before calling swap(other), or the swap logic should be replaced with direct member moves.
| ChunkedArray(ChunkedArray&& other) noexcept { | |
| ChunkedArray(ChunkedArray&& other) noexcept | |
| : element_byte_size_(0), | |
| elements_per_chunk_(0), | |
| element_count_(0), | |
| chunk_padding_bytes_(0) { |
| MallocUniqueCharArrayPtr makeUniqueCharArray(size_t n_bytes) { | ||
| char* raw_ptr = static_cast<char*>(malloc(n_bytes)); | ||
| return MallocUniqueCharArrayPtr(raw_ptr); | ||
| } |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The makeUniqueCharArray function doesn't check if malloc returns nullptr before wrapping it in a unique_ptr. If memory allocation fails, this will create a unique_ptr containing nullptr, which could lead to crashes when dereferenced. Consider checking for allocation failure and throwing an exception or returning an error status.

Original patch was created by @PingXie (reviewed by @yairgott). These changes originated in the valkey-search project.
Original commit message below:
This change improves the memory allocation strategy of the HNSW index data structures by adopting chunked arrays. Previously, resizing the HNSW index involved expensive realloc calls, which were particularly costly in two ways:
Memory Consumption: During resizing, realloc temporarily requires double the memory to accommodate the old data block until the new block is ready and the old block is freed. This posed a substantial issue when dealing with gigabytes of data, leading to excessive memory usage.
Performance Overhead: The realloc operation entails copying data from the old block to the new block, invoking an expensive memcpy operation. This process becomes increasingly burdensome as the data size grows, resulting in significant performance degradation for large-scale indices.
By transitioning to chunked arrays for memory allocation, we circumvent the need for realloc calls during resizing. This approach not only reduces the memory footprint by eliminating the temporary doubling of memory but also enhances overall performance by avoiding the costly memcpy operations. The chunked array strategy allocates new memory in smaller, manageable chunks as the data grows, ensuring more efficient use of memory and improving the scalability of HNSW index operations, especially critical for indices spanning multiple gigabytes.